Goto

Collaborating Authors

 audio-visual video




Multi-modalGroupingNetworkfor Weakly-SupervisedAudio-VisualVideoParsing (SupplementaryMaterial)

Neural Information Processing Systems

However, the number of learned group tokens in GroupViT is a hyper-parameter and there is no constraint on it. The textembeddings is used inacontrastiveloss tomatch with the global visual representations. Figure 1: Comparison results of recall for all 25 classes between HAN [2] and the proposed MGN in terms of event-level audio, visual and audio-visual metrics,i.e.,Event_A,Event_V,and Event_AV.